feat: add median and count_distinct aggregation functions #278

vibhatha · 2022-08-04T05:42:19Z

This PR includes yaml configs for median, approx_median and count_distinct for aggregate functions.

extensions/functions_arithmetic.yaml

vibhatha · 2022-08-23T09:19:37Z

@jvanstraten would it be better introduce an option like precision or and introduce EXACT and APPROXIMATE instead of writing two separate functions. I did this today for another PR. Not sure if it is a good way to handle it. WDYT?

jvanstraten

Sorry, I forgot to update this.

would it be better introduce an option like precision or and introduce EXACT and APPROXIMATE instead of writing two separate functions

I don't really have an opinion, but unless there's a statistical definition I'm not aware of, I do believe we need to specify what "approximate" means. Is 100.0 an acceptable approximate median of [101.2, 101.3, 101.4]? Probably not, so where's the limit?

jvanstraten · 2022-08-23T12:24:28Z

extensions/functions_arithmetic.yaml

@@ -742,6 +742,105 @@ aggregate_functions:
          - value: fp64
        nullability: DECLARED_OUTPUT
        return: fp64?
+  - name: "median"
+    description: Calculate the median for a set of values.


IMO this should include something like "Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode."

You mean the description?

jvanstraten · 2022-09-02T13:11:12Z

@vibhatha Please extend the descriptions. Here's a good example of the level of descriptiveness I'm looking for:

substrait/extensions/functions_string.yaml

Lines 571 to 608 in add698f

    
               name: replace_slice 
        
               description: >- 
        
                 Replace a slice of the input string.  A specified 'length' of characters will be deleted from 
        
                 the input string beginning at the 'start' position and will be replaced by a new string.  A 
        
                 start value of 1 indicates the first character of the input string. If start is negative 
        
                 or zero, or greater than the length of the input string, a null string is returned. If 'length' 
        
                 is negative, a null string is returned.  If 'length' is zero, inserting of the new string 
        
                 occurs at the specified 'start' position and no characters are deleted. If 'length' is 
        
                 greater than the input string, deletion will occur up to the last character of the input string. 
        
               impls: 
        
                 - args: 
        
                     - value: "string" 
        
                       name: "input" 
        
                       description: Input string. 
        
                     - value: i64 
        
                       name: "start" 
        
                       description: The position in the string to start deleting/inserting characters. 
        
                     - value: i64 
        
                       name: "length" 
        
                       description: The number of characters to delete from the input string. 
        
                     - value: "string" 
        
                       name: "replacement" 
        
                       description: The new string to insert at the start position. 
        
                   return: "string" 
        
                 - args: 
        
                     - value: "varchar<L1>" 
        
                       name: "input" 
        
                       description: Input string. 
        
                     - value: i64 
        
                       name: "start" 
        
                       description: The position in the string to start deleting/inserting characters. 
        
                     - value: i64 
        
                       name: "length" 
        
                       description: The number of characters to delete from the input string. 
        
                     - value: "varchar<L2>" 
        
                       name: "replacement" 
        
                       description: The new string to insert at the start position. 
        
                   return: "varchar<L1>"

Better to be verbose and explicit than to assume the reader already knows what the function does, especially in odd corner cases, and in this case for what "approximate" actually means in terms of accuracy.

Also, IIRC we went for an option for approximate vs ~~precise~~ exact for other statistical functions? Not sure if they're merged yet but I remember that being the conclusion.

vibhatha · 2022-09-05T01:05:38Z

@jvanstraten we had a function for population and sample and I did write a option recently for precision. I am going to add it here as well. Let me push what I wrote, then we can probably re-word it and standardize to re-use.

I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability. I am not sure if this is possible, but just a suggestion. Not asking about array of items, just

operator_options:
     - population
     - precision
     - ....

And the operator_options can be read from somewhere else. How about that?

jvanstraten

I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability.

We've run into similar things with the YAMLs before, and the conclusion has always been that it's easier on consumers of the YAML files to just repeat everything. If at some point it becomes really annoying we could always just generate the YAML files. However, the descriptions of the standard options we use all over the core extensions (like overflow, rounding, and I guess precision and population) could IMO just be defined on the website instead of in the YAMLs. A computer that's parsing these YAML files won't care about the description anyway. There's no page for this yet, but I'm planning to rewrite/add to the extension pages of the website soon, and there will absolutely be a section for this. So I'm fine with just omitting the descriptions of precision for now.

extensions/functions_arithmetic.yaml

jvanstraten · 2022-09-05T10:59:02Z

extensions/functions_arithmetic.yaml

+              on saving memory bandwidth, the precision of the end result can be
+              the highest possible accuracy of an approximation.
+
+                - EXACT: provides the highest accurate output


Suggested change

- EXACT: provides the highest accurate output

- EXACT: provides the exact result, rounded if needed according

to the rounding option

jvanstraten · 2022-09-05T11:02:34Z

extensions/functions_arithmetic.yaml

+              the highest possible accuracy of an approximation.
+
+                - EXACT: provides the highest accurate output
+                - APPROXIMATE: provides a sub-optimal output


This still doesn't specify how approximate the result may be. Is this what you mean, or is this too broad?

Suggested change

- APPROXIMATE: provides a sub-optimal output

- APPROXIMATE: provides only an estimate; the result must lie

between the minimum and maximum values in the input

(inclusive), but otherwise the accuracy is left up to the

consumer

I wanted to mean this one, but may be we could elaborate this a bit?

Although, this is broad since the optimization strategy is very hard to specify. Again I think the choice of approximation is up to the engine to decide and that depends on various optimization techniques. But that's not the job of Substrait to define that optimization strategy. Substrait can only specify that it is going to be an approximation. So I wanted to put forward that idea.

Should we enhance more?

I don't know that we can do better than what I suggested to constrain the approximation method further, but you're welcome to try. I at least can't think of any approximate median algorithm that doesn't at least satisfy that constraint, and only trivial constraints is better than no constraints, IMO.

Let's keep the suggested once for now. I will update this.

vibhatha · 2022-09-05T13:51:53Z

I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability.

We've run into similar things with the YAMLs before, and the conclusion has always been that it's easier on consumers of the YAML files to just repeat everything. If at some point it becomes really annoying we could always just generate the YAML files. However, the descriptions of the standard options we use all over the core extensions (like overflow, rounding, and I guess precision and population) could IMO just be defined on the website instead of in the YAMLs. A computer that's parsing these YAML files won't care about the description anyway. There's no page for this yet, but I'm planning to rewrite/add to the extension pages of the website soon, and there will absolutely be a section for this. So I'm fine with just omitting the descriptions of precision for now.

I get your point. I think this sounds good for now.

…ctions

jacques-n · 2022-09-05T16:31:27Z

Sorry, just saw this come through. I think we should revert the addition of count distinct here. Distinct isn't generally a property of the function, it is a property of the aggregate. That's why we have distinct as a property of function invocation.

You achieve count distinct by combining count + that property.

jvanstraten · 2022-09-05T17:45:57Z

Oops, you're right, I didn't think of that. #311

vibhatha marked this pull request as ready for review August 4, 2022 05:53

cpcloud reviewed Aug 10, 2022

View reviewed changes

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved

cpcloud reviewed Aug 10, 2022

View reviewed changes

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved

vibhatha requested review from cpcloud and jvanstraten August 15, 2022 11:59

vibhatha force-pushed the aggregate_meadian_count branch from e641541 to 4a4921c Compare August 15, 2022 12:01

jvanstraten requested changes Aug 23, 2022

View reviewed changes

jvanstraten changed the title ~~feat: adding median, approx_median and count_distinct for aggregation functions~~ feat: add median, approx_median and count_distinct for aggregation functions Sep 2, 2022

vibhatha force-pushed the aggregate_meadian_count branch from 4c30d4a to d2dc8de Compare September 5, 2022 01:15

vibhatha requested review from jvanstraten and removed request for cpcloud September 5, 2022 01:15

jvanstraten requested changes Sep 5, 2022

View reviewed changes

vibhatha added 5 commits September 5, 2022 20:47

feat(aggregates): adding median, approx_median and count_distinct fun…

96e7698

…ctions

fix(rounding): adding rounding options

72c0c68

fix(typo): added period and removed decomposition

5757b98

fix(review): added precision option and refactored description

568f5fd

fix(description): address reviews and update the description

7db8984

vibhatha force-pushed the aggregate_meadian_count branch from e7e68fc to 7db8984 Compare September 5, 2022 15:26

vibhatha requested review from jvanstraten and cpcloud and removed request for jvanstraten and cpcloud September 5, 2022 15:27

jvanstraten approved these changes Sep 5, 2022

View reviewed changes

jvanstraten changed the title ~~feat: add median, approx_median and count_distinct for aggregation functions~~ feat: add median and count_distinct aggregation functions Sep 5, 2022

jvanstraten merged commit 9be62e5 into substrait-io:main Sep 5, 2022

ianmcook mentioned this pull request Sep 6, 2022

Implement approx_count_distinct via options to count #313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add median and count_distinct aggregation functions #278

feat: add median and count_distinct aggregation functions #278

vibhatha commented Aug 4, 2022

vibhatha commented Aug 23, 2022

jvanstraten left a comment

jvanstraten Aug 23, 2022

vibhatha Aug 26, 2022

jvanstraten Aug 29, 2022

jvanstraten commented Sep 2, 2022 •

edited

Loading

vibhatha commented Sep 5, 2022

jvanstraten left a comment

jvanstraten Sep 5, 2022

jvanstraten Sep 5, 2022

vibhatha Sep 5, 2022

jvanstraten Sep 5, 2022

vibhatha Sep 5, 2022

vibhatha commented Sep 5, 2022

jacques-n commented Sep 5, 2022

jvanstraten commented Sep 5, 2022

	- EXACT: provides the highest accurate output
	- EXACT: provides the exact result, rounded if needed according
	to the rounding option

-                - APPROXIMATE: provides a sub-optimal output
+                - APPROXIMATE: provides only an estimate; the result must lie
+                  between the minimum and maximum values in the input
+                  (inclusive), but otherwise the accuracy is left up to the
+                  consumer

feat: add median and count_distinct aggregation functions #278

feat: add median and count_distinct aggregation functions #278

Conversation

vibhatha commented Aug 4, 2022

vibhatha commented Aug 23, 2022

jvanstraten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten commented Sep 2, 2022 • edited Loading

vibhatha commented Sep 5, 2022

jvanstraten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Sep 5, 2022

jacques-n commented Sep 5, 2022

jvanstraten commented Sep 5, 2022

jvanstraten commented Sep 2, 2022 •

edited

Loading